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Abstract. Wc describe our software package Block Locally Optimal Preconditioned Eigenvalue 
Xolvers (BLOPEX) publicly released recently. BLOPEX is available as a stand-alone serial library, as 
an external package to PETSc ( "Portable, Extensible Toolkit for Scientific Computation" , a general 
purpose suite of tools for the scalable solution of partial differential equations and related problems 
developed by Argonne National Laboratory), and is also built into hypre ("High Performance Precon- 
ditioned", scalable linear solvers package developed by Lawrence Livermore National Laboratory). 
The present BLOPEX release includes only one solver — the Locally Optimal Block Preconditioned 
Conjugate Gradient (LOBPCG) method for symmetric eigenvalue problems, hypre provides users 
with advanced high-quality parallel preconditioners for linear systems, in particular, with domain 
decomposition and multigrid preconditioners. With BLOPEX, the same preconditioners can now 
be efficiently used for symmetric eigenvalue problems. PETSc facilitates the integration of indepen- 
dently developed application modules with strict attention to component interoperability, and makes 
BLOPEX extremely easy to compile and use with preconditioners that are available via PETSc. We 
present the LOBPCG algorithm in BLOPEX for hypre and PETSc. We demonstrate numerically the 
scalability of BLOPEX by testing it on a number of distributed and shared memory parallel systems, 
including a Beowulf system, SUN Fire 880, an AMD dual-core Opteron workstation, and IBM Bluc- 
Gene/L supercomputer, using PETSc domain decomposition and hypre multigrid preconditioning. 
We test BLOPEX on a model problem, the standard 7-point finite-difference approximation of the 
3-D Laplacian, with the problem size in the range 10 5 — 10 s . 
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1. Introduction. We describe a new software package Block Locally Optimal 
Preconditioned Eigenvalue Xolvers (BLOPEX) revision 1.0 publicly released recently. 
BLOPEX is available as a stand-alone serial library, which can be downloaded from 
http://matri.cvidenver.eclu/~aknyazev/software/BLOPEX/, as an external package 
to the Portable, Extensible Toolkit for Scientific Computation (PETSc) [2], and is 
built into the High Performance Preconditioners (hypre) library Falgout et al. [8, 9]. 
Jose Roman has also written a Scalable Library for Eigenvalue Problem Computations 
(SLEPc) [10]) interface to our hypre BLOPEX. 

The native hypre BLOPEX version efficiently and directly takes advantage of 
powerful hypre multigrid preconditioners, both algebraic (AMG, called BoomerAMG 
in hypre) and geometric or structured (SMG). BLOPEX in PETSc gives the PETSc 
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community easy access to the customizable code of a preconditioned cigcnsolver. 

At present, BLOPEX includes only one solver — the Locally Optimal Block Pre- 
conditioned Conjugate Gradient (LOBPCG) Knyazev [17, 18] method for eigenprob- 
lems Ax — XBx with large, possibly sparse, symmetric matrix A and symmetric 
positive definite matrix B. Preconditioned eigenvalue solvers in general, see Knyazev 
[16, 17], and in particular the LOBPCG method have recently attracted attention as 
potential competitors to (block-) Lanczos methods. LOBPCG has been implemented 
using different computer languages in a number of other software packages: CH — h in 
Anasazi Trilinos Arbenz et al. [1], Heroux et al. [11] and in NGSolve Zaglmayr [30, 
Sections 7.2-7.4]; C in PRIMME Stathopoulos and J. R. McCombs [25]; FORTRAN 
77 in PLTMG Bank [3]; FORTRAN 90 in ABINIT Bottin et at. [5] and in PESCAN 
Tomov et al. [26]; Python in SciPy by Robert Cimrman and in PYFEMax by Peter 
Arbenz and Roman Geus. LOBPCG has been independently tested in Yamada et al. 
[27, 28] for Fermion-Habbard Model on Earth Simulator CDIR/MPI; in Arbenz et al. 
[1] and Borzi and Borzi [4] using AMG preconditioning for vibration problems; and 
in Tomov et al. [26], Yang et al. [29] for electronic structure calculations. Our hypre 
BLOPEX version has been used in Bramble et al. [6] for the Maxwell problem. 

Section 2 contains a complete and detailed description of the LOBPCG algorithm 
as implemented in BLOPEX. We discuss our specific abstract implementation ap- 
proach in section 3. Parallel performance on distributed and shared memory systems 
using domain decomposition and multigrid preconditioning is discussed in section 4. 
We collect technical information, e.g., the list of acronyms, in the appendix. 

2. LOBPCG and its Implementations. In subsection 2.1, we briefly describe 
the ideas behind the LOBPCG method mostly following Knyazev [17, 18]. In subsec- 
tion 2.2, we present a complete description of the LOBPCG algorithm as implemented 
in BLOPEX. Deflation by hard and soft locking is suggested in subsection 2.3. 

2.1. LOBPCG Ideas and Description. 

2.1.1. The problem and the assumptions. We consider the problem of com- 
puting the m smallest eigenvalues and the corresponding eigenvectors of the general- 
ized eigenvalue problem Ax = XBx, with symmetric (Hcrmitian in the complex case) 
matrices A and B, the latter assumed to be positive definite. For a given approximate 
eigenvector x, we use the traditional Rayleigh quotient X(x) = (x, Ax)/(x, Bx) in the 
standard scalar product as an approximation to the corresponding eigenvalue. We 
emphasize that only matrix B is assumed to be positive definite. There is no such 
requirement on A; moreover, in all formulas in this section we can replace A with 
A + aB and the formulas are invariant with respect to a real shift a. 

To accelerate the convergence, we introduce a preconditioner T, which is typically 
a function that for a given vector x produces Tx. The existing theory, e.g., Knyazev 
[18], Knyazev and Neymeyr [21], of the LOBPCG method is based on several assump- 
tions: T needs to be linear, symmetric, and positive definite. Under these assumptions 
a specific convergence rate bound is proved in [18, 21]. Numerical tests suggest that 
all these requirements on T are necessary in order to guarantee this convergence rate. 
The use of preconditioning that does not satisfy the requirements above not only can 
slow down the convergence, but may also lead to a breakdown of the method and 
severe instabilities in the code resulting in incorrect results — that should not be mis- 
takenly interpreted as bugs in the code. The method is robust however if T changes 
from iteration to iteration as soon as in every step it satisfies the above assumptions. 
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2.1.2. Single- Vector LOBPCG. For computing only the smallest eigenpair, 
i.e. if m = 1, the LOBPCG method takes the form of a 3-term recurrence: 

x (i+i) = w (i) +T (*) x (i) +7 (*) x (i-i) j 

tuW = T(Ax® -\®Bx®), A« = A(a;«) = AcW)/(Ba?W,a?W) ( J 

with properly chosen scalar iteration parameters r'*), and 7^. The easiest and per- 
haps the most efficient choice of parameters is based on an idea of local optimality 
Knyazcv [14. 17, 18], namely, select and 7^ that minimize the Rayleigh quotient 
A(a;( 8+1 )) by using the Rayleigh- Ritz procedure in the three-dimensional trial subspace 
spanning xM, toW, and x^ % ~ x '. We do not describe the Rayleigh- Ritz procedure using 
an arbitrary basis of the trial subspace in this paper and assume that the reader is 
familiar with it. 

If 7W — 0, method (2.1) turns into the classical preconditioned locally optimal 
steepest descent method, so one can think of method (2.1) as an attempt to accelerate 
the steepest descent method by adding an extra term to the iterative recurrence; e.g., 
Knyazev [14, 17]. The general idea of accelerating an iterative sequence by involving 
the previous approximation is very natural and is well known. Let us note here that 
if instead of the previous approximation x^ -1 ' we use the previous preconditioned 
residual w' l ~ 1 ' in formula (2.1), we get a different method that in practical tests 
converges not as fast as method (2.1). 

When implementing the Raylcigh-Ritz procedure in the trial subspace spanning 
X W ; w w ( an d xt 1-1 ) ; W e need to be careful in choosing an appropriate basis that leads 
to reasonably conditioned Gram matrices. The current eigenvector approximation 
x^ and the previous eigenvector approximation x^~^' get closer to each other in 
the process of iterations, so special measures need to be taken in method (2.1) to 
overcome the potential numerical instability due to the round-off errors. Using both 
vectors x^> and cc^ -1 ) as basis vectors of the trial subspace leads to ill-conditioned 
Gram matrices, and the Rayleigh-Ritz method can produce spurious eigenpairs. 

A simple fix is suggested in Knyazev [18]: the three-term recurrence that contains 
the current eigenvector approximation, the preconditioned residual, and the implicitly 
computed difference between the current and the previous eigenvector approximations: 



X (i+V = w (i) +T (i) x (i) + 7 W p (t) ) w (i) =T(Ax ( V - A^BarW), 
p (i+i) = w (i) + 7 (%W ; p (o) = 0, A« = A(xW). 



(2.2) 



G span{u/ 4 \ x®, p {t) } = span{u?W, x®, a^" 1 '} as = a-(' i+1 ) -r^x^, 

therefore, the new formula (2.2) is mathematically equivalent to (2.1), in exact arith- 
metic. We choose in (2.2) the scalar iteration parameters t^' and 7 W as above, i.e. 
minimizing the Rayleigh quotient A(x^ +1 '). 

We note that formula (2.2) is not quite the ultimate fix, e.g., it is possible in 
principle that at the initial stage of the iterations is small, so that is too 

close to x(' l+1 \ and the Raylcigh-Ritz procedure may fail. The proper choice of the 
basis in the trial subspace span{tt)W, a;W,pW} f or Raylcigh-Ritz procedure in 
(2.2), having in mind that vectors and converge to zero, is not a trivial issue, 
see Hetmaniuk and Lehoucq [12] for possible causes of LOBPCG instability. 

The locally optimal choice of the step sizes in the LOBPCG allows an easy gen- 
eralization from the single-vector version (2.1) or (2.2) to the block version described 
next in subsection 2.1.3, where a block of m vectors is iterated simultaneously We 
return to the discussion of the choice of the basis in subsection 2.2, where it becomes 
even more vital with the increase of the block size m. 
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2.1.3. The LOBPCG Basics. A block version of LOBPCG for finding the 
m > 1 smallest eigenpairs is suggested in Knyazev [16, 17]: 

espanjxf" 1 ', T(A - \®B)x®, . . . , xg^, x® , T(A - X® B)x$} , 

where is computed as the j-th Ritz vector, j = 1, ... ,m corresponding to the 

j-th smallest Ritz value in the Rayleigh-Ritz procedure on a 3TO-dimensional trial 
subspace. The block method has the same problem of having close vectors in the trial 
subspacc as the single-vector version (2.1) discussed in the previous subsection, for 
which Knyazev [18] suggested the same fix using the directions p. 

As in other block methods, the block size should be chosen, if possible, to provide 
a large gap between first m eigenvalues and the rest of the spectrum as this typically 
leads to a better convergence, see Knyazev [18], Knyazev and Neymeyr [21], Ovtchin- 
nikov [24]. Block methods generally handle clusters in the spectrum and multiple 
eigenvalues quite well; and the LOBPCG is no exception. An attempt should be 
made to include the whole cluster of eigenvalues into the block, while for multiple 
eigenvalues this is not essential at all. The numerical experience is that the smallest 
eigenvalues often converge faster, and that the few eigenvalues with the noticeably 
slow convergence usually are the largest in the block and typically correspond to a 
cluster of eigenvalues that is not completely included in the block. A possible cure is 
to use a flexible block size a bit larger than the number of eigenpairs actually wanted 
and to ignore poorly converging eigenpairs. 

The block size m should not be so big that the costs of the Rayleigh-Ritz procedure 
in the 3m-dimcnsional trial subspace dominate the costs of iterations. If a large 
number of eigenpairs is needed, deflation is necessary, see subsection 2.3. The optimal 
choice of the block size m is problem-, software- and computer-dependent; e.g., it is 
affected by such details as the amount of the CPU cache and the use of optimized 
BLAS library functions in the multivector implementation, see section 3. 

2.2. The Detailed Description of the LOBPCG Algorithm. The descrip- 
tion of the LOBPCG algorithm as implemented in our BLOPEX-1.0 code follows: 
Input: m starting linearly independent vectors in I £ R™ xm , I linearly independent 
constraint vectors in Y E M. nxl , devices to compute A* X, B * X and T * X. 

1. Allocate memory W, P, Q, AX, AW, AP, BX, BW, BP E R nXm ,BY E R nxl . 

2. Apply the constraints to X: 

BY = B *Y; X = X -Y * (Y T * BY)" 1 * {{BY) T * X). 

3. B-orthonormalizc X: BX = B*X;R = chol(A T * BX);X = X * Rr 1 ; 

BX = BX * R- 1 ; AX = A*X. (Note: "chol" is the Cholesky decomposition). 

4. Compute the initial Ritz vectors: solve the eigenproblem 
(X T * AX) * TMP = TMP * A; 

and compute X = X * TMP; AX = AX * TMP; BX = BX * TMP. 

5. Define the index set J of active iterates to be {1, ... , m}. 

6. for k = 0, . . . , Max Iterations: 

7. Compute the residuals: Wj = AXj — BXj * Aj. 

8. Exclude from the index set J the indices that correspond to residual 
vectors for which the norm has become smaller than the tolerance. 
If J then becomes empty, exit loop. 

9. Apply the preconditioner T to the residuals: Wj = T * Wj. 
10 Apply the constraints to the preconditioned residuals Wj: 

Wj = Wj -Y * (Y T * BY)" 1 * ((BY) T * Wj). 
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11. Compute BWj and 5-orthonormalize Wj\ BWj = B * Wj 
R = chol(Wj * BWj); Wj = Wj* Rr 1 ; BWj = BWj * R~ 



12. 
13. 
14. 
15. 
16. 



17. 
18. 

19. 

20. 
21. 

22. 

23. 
24. 



25. 
20. 

27. 

28. 
29. 



30. 

31. 
32. 
33. 



Compute AWj: AWj = A*Wj. 
if k > 

B-orthonormalize Pj: R = chol(Pj * BPj); Pj 
Update APj = APj * Rr 1 ', BPj = BPj *Rr 1 . 
end if 

Perform the Rayleigh Ritz Procedure: 
Compute symmetric Gram matrices: 

if k >0 

A. X T * AWj 
gramA = ■ Wj * AWj 



P J *R- 1 ; 



gramB 



else 



gramA = 



I X T * BWj 



A X T * AWj 

■ Wj * AWj 



X T * APj 
Wj * APj 
Pj * APj 

X T * BPj 
Wj * BPj 

I 



I X 1 



*BWj 

I 



gramB = 
end if 

Solve the generalized eigenvalue problem: 

gramA * C = gramB * C * A, where the first m eigenvalues in 
increasing order are in the diagonal matrix A and the 
graTO-B-orthonormalizcd eigenvectors are the columns of C . 
Compute Ritz vectors: 
if k > 

" C' x 

Cw according to the number of columns in 
C P 



Partition C 



X, Wj, and Pj, respectively. 
Compute P = Wj * Cw + Pj * Cp; 

AP = AWj * Cw + APj * C P ; BP = BWj *C W + BPj * C P . 
X = X *C X + P;AX = AX *C X + AP;BX = BX *C X + BP. 



else 



Partition C 



according to the number of columns in 



C x 
Cw 

X and Wj respectively. 

P = Wj* C w ; AP = AWj * C W ;BP = BWj * C W - 
X = X *C X +P;AX = AX *C X + AP;BX = BX * C x + BP. 
end if 
37. end for 

Output: The eigenvectors X and the eigenvalues A. 



For description of all LOBPCG-BLOPEX variables see Appendix A. 

In the complex case, the transpose needs to be replaced the adjoint. Only double- 
precision real arithmetic is supported in the current revision 1.0 of BLOPEX. 
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The algorithm is matrix-free in the sense that the matrices A and B are not 
needed in the algorithm and not stored in the code, but rather are accessed only 
through matrix-vector product functions, provided by the user. Matrix-free codes are 
needed in applications where the matrices A and B are never explicitly formed, e.g., 
in finite element method software packages for partial differential equations. 

The algorithm uses only one block application of the preconditioner T (step 9) 
and one block matrix- vector product Bx (step 11) and Ax (step 12), per iteration. 
This is achieved by storing and manipulating 9m vectors (6m vectors if B = I). If 
there is not enough memory for 9m vectors, or if m is so large that linear algebra 
operations with 9m vectors are more costly than multiplication by A and B, e.g., if 
A and B are very sparse, the algorithm can be modified, so that only 3m vectors are 
stored, but then additional block matrix-vector products Ax and Bx are required per 
iteration. Such a modification is not yet currently available in BLOPEX. 

The issue of a good choice of the basis to use in the Rayleigh-Ritz method becomes 
even more delicate with a larger number m of vectors in the block. Ill-conditioned 
Gram matrices in the Rayleigh-Ritz procedure may result in algorithm failure, inac- 
curate computations of the Ritz pairs, or spurious Ritz pairs. As it can be seen in the 
LOBPCG algorithm in steps 11-22 we explicitly compute X, P, and W, and form the 
basis of the Rayleigh-Ritz procedure in the following way: X is left untouched, since 
it has already B-orthonormal columns, made of the Ritz vectors, while the vectors 
inside both W and P are B-orthonormalized in steps 11 and 14, correspondingly, so 
we put the identities on the block diagonal of the matrix gramB (step 19/22). 

For the B-orthonormalization of W and P in steps 11 and 14, the cheapest or- 
thonormalization technique, using the Cholcsky decomposition of the Gram matrix, 
has been chosen. This technique is known to be of a poor quality in the pres- 
ence of the round-off errors. Nevertheless, in our experience, the Cholesky-based 
B-orthonormalization of W and P reduces the risk of the Rayleigh-Ritz procedure 
failure while being considerably cheaper than the stable B-orthonormalization of all 
3m vectors in the basis of the trial subspace described, e.g., in [12]. 

Let us note that matrices P and W converge to zero, even worse, different columns 
of P and W converge to zero with different speeds, so matrices plugged in the Cholcsky 
decomposition are extremely badly scaled. It is crucial for the stability that the code of 
the Cholesky decomposition used here is numerically invariant with respect to matrix 
scaling, otherwise, the explicit normalization of columns of P and W is necessary prior 
to the call of the Cholcsky decomposition. Such a subtle property as scaling invariance 
that we rely upon here may apparently be affected by aggressive optimization flags 
during the compilation of the Cholesky decomposition code. 

We have also tested (in MATLAB) a version where the block diagonal of the 
matrix gramB is explicitly computed instead of just setting the blocks to be the 
identities. Such an approach is similar to a known idea of repeated orthonormalization 
and results in a better attainable accuracy. However, in our tests we have found 
the attainable accuracy of our Algorithm 2.2 to be already adequate, so this is not 
implemented in our BLOPEX LOBPCG. 

Numerical software development is often an act of a balanced compromise between 
performance, reliability, and accuracy. Performance has been our main priority in 
BLOPEX release 1.0, preserving reasonable reliability for common situations, and 
achieving the accuracy typical for large-scale applications in engineering. Design of 
other efficient and reliable implementations of the LOBPCG algorithm for BLOPEX 
is underway. 
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2.3. LOBPCG deflation by hard and soft locking. When several eigenvec- 
tors are computed simultaneously, it is often the case that some eigenvectors converge 
faster than others. To avoid the unnecessary computational work, one "locks" the 
eigenvectors that have already converged within a required tolerance while continuing 
to iterate the others. Locking approximately computed eigenvectors is not well stud- 
ied theoretically. In this subsection, we introduce hard and soft locking, and explain 
how locking is used in the LOBPCG algorithm, following the presentation of Knyazev 
[19]. The idea of soft locking is very natural and may have been known to some 
practitioners, developing codes for symmetric eigenvalue problems, even prior to [19], 
where the term "soft locking" has apparently been first introduced, but we are not 
aware of any earlier references. 

A commonly used locking technique, known as the deflation by restriction, is to 
keep the locked eigenvectors unchanged and those that are still iterated orthogonal 
to the locked ones. This technique is usually called "locking," but here it is referred 
to as "hard locking" to distinguish it from a new type of locking described next. 

A different technique, called here "soft locking" , can be used in simultaneous 
iterations combined with the Rayleigh-Ritz procedure, The idea of this technique is 
to remove the residuals of the locked vectors from the computation, but continue 
using the vectors themselves in the Rayleigh-Ritz procedure, which can change them. 
The orthogonality of the soft locked vectors to iterative vectors follows from the well- 
known fact that the Ritz vectors corresponding to different Ritz values are orthogonal, 
and those corresponding to multiple Ritz values can be chosen to be orthogonal. 

Compared to the hard locking, soft locking is computationally more expensive. 
The advantage of the soft locking is that it provides more accurate results. First and 
foremost, as the number of hard locked vectors increases, the attainable accuracy of 
the iterated ones may decrease, which may prevent them from reaching the required 
tolerance. With soft locking, the attainable accuracy of the iterated vectors does not 
depend on the number and the accuracy of soft locked ones. Second, if the Rayleigh 
quotient of fc-th locked vector is less than the (k + l)-th exact eigenvalue, then the 
accuracy of the first k locked vectors will increase in the course of the iterations, 
owing to the participation in the Rayleigh-Ritz procedure, even though their role in 
the iterative loop is now reduced to this procedure only. 

Below we describe the hard and soft locking as used in the BLOPEX imple- 
mentation of LOBPCG and present preliminary numerical results demonstrating the 
effectiveness of the soft locking compared to the traditional hard locking. 

Let us consider the single- vector LOBPCG method with no locking as described 
by (2.1) x^+V = w^+T^x^+^xV-^, roW = T(Ax® -\®Bx®) that we repeat 
here for the reader's convenience. Let P be a projector on the complement to the span 
of already computed eigenvectors. In LOBPCG with the hard locking, we redefine 
W W = PT(Ax^ — X^'Bx^), so that all iterates are now within the range of P if the 
initial guess in chosen within the range. Traditionally, P is orthogonal in the -B-based 
scalar product and the complement is orthogonal also in the i?-based scalar product, 
since this allows the easiest and most inexpensive implementation. This is what we 
use in step 10 of the LOBPCG algorithm in Subsection 2.2. The constraints, given 
by the columns of the matrix Y, are in this case the already approximately computed 
eigenvectors that we want to deflate. 

If the locked eigenvectors are computed exactly then it is easy to analyze the 
influence of classical hard locking and to show that it is equivalent to simply removing 
the locked eigenpairs from the consideration. If the locked eigenvectors are only 
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approximate, the range of P is not exactly an invariant subspace of the matrix pencil 
A — XB, and the analysis of the influence of the approximation quality on locking 
becomes much more complicated; sec e.g. D'yakonov and Knyazev [7]. 

As the number of locked vectors increases, their inaccuracy may affect the at- 
tainable accuracy of iterated vectors. To avoid this, in the LOBPCG algorithm in 
Subsection 2.2 we use soft locking. We denote by Xj the matrix that contains only the 
active (not soft locked) iterative eigenvector approximations as columns, while all m 
iterative vectors X arc included in the Raylcigh-Ritz procedure. Only Xj participates 
in iterations, i.e., having and xf we compute only wf = T {Axf - B xf ) , 

where A ^ is a diagonal matrix of approximate eigenvalues corresponding to Xj^ . We 
include the complete X, but only Wj and Pj, in the basis of the the Rayleigh-Ritz 
procedure in steps 18-22 and 26-32 of the LOBPCG algorithm in Subsection 2.2. 

The positive influence of the soft locking is especially noticeable when the stopping 
tolerance is not so small. Figure 2.1 presents the results of the soft locking when the 
residual norm reaches 1CP 1 in LOBPCG with m = 10 for a model problem. We 
observe that the active eigenpairs continue to converge and that the eigenpairs, which 
are already soft locked, still improve their accuracy. 



LO Block CG LO Block CG 




1 2 3 4 5 61234567 

Iteration number Iteration number 



Fig. 2.1. Residual norms (left) and eigenvalue errors (right) for 10 eigenpairs in 
LOBPCG with soft locking. Soft locked (the bottom 6) eigenpairs continue to improve. 

To summarize, we use the classical "hard" locking in LOBPCG through the con- 
straints, and soft locking inside the LOBPCG code. If the user wants to compute so 
many eigenpairs that it is not reasonable to include all of them at once into the block, 
the user should determine an appropriate block size m for the given hardware and 
call LOBPCG with the block size m several times, putting the previously computed 
eigenvectors into the constraints for the next LOBPCG call. The LOBPCG MATLAB 
code includes an example of such a call in the help information. 

3. Abstract BLOPEX implementation for PETSc and hypre. PETSC 
and hypre are software libraries aimed at solving large systems on massively parallel 
computers. Our native PETSc BLOPEX version gives the PETSc user community 
easy access to the customizable code of a modern preconditioned eigensolver and an 
opportunity to easily call hypre preconditioners from PETSc. The BLOPEX built-in 
hypre version efficiently takes direct advantage of powerful hypre AMG preconditioncr 
Boomer AMG and its geometric multigrid preconditioners PFMG and SMG. 



BLOPEX IN HYPRE AND PETSC 



9 



The BLOPEX library at present includes only the LOBPCG cigcnsolver. The 
BLOPEX code is written in C-language and calls a few LAPACK subroutines for 
dense matrix operations. The matrix-vector multiply and the preconditioner call are 
done through user supplied functions. The main LOBPCG code is abstract in the 
sense that it works only through an interface that determines the particular software 
environment: PETSc or hypre or user defined, in order to call parallel (multi)vcctor 
manipulation routines. A block diagram of the main modules is given in Figure 3.1. 





PETSc driver for LOBPCG 




hypre driver for LOBPCG 










Interface PETSc-BLOPEX 




Interface hypre-BLOPEX 








PETSc libraries Abstract BLOPEX hypre libraries 



Fig. 3.1. BLOPEX hypre and PETSc software modules 

In the abstract part of BLOPEX, the blocks of vectors (X, P etc) are represented 
by an abstract data object we call "multivector." A particular implementation of 
the multivector is outside of our code, so we only provide an interface to it. Ideally, 
all operations involving multivectors should be implemented in a block fashion, to 
minimize the number of exchanges between the processors and to take advantage of a 
highly optimized matrix-matrix multiplication routine dgemm in BLAS when comput- 
ing Gram matrices and updating multivectors. However, in PETSc and hypre such 
an object is currently not available to users to our knowledge, so we had to temporar- 
ily use the multivector interfaces as references to individual vectors. When properly 
implemented multivectors become available in PETSc and hypre, we plan to change 
our interface codes to use the true multivectors, which is expected to lead to a great 
speed-up due to a smaller number of communications and efficient BLAS use. 

4. BLOPEX LOBPCG Numerical Results in PETSc and hypre. Here we 
presents results accumulated over considerable length of time. Despite our efforts, 
the reader may still notice some inconsistency, e.g., some tests may use different 
tolerances. All tests are performed in double precision real arithmetic. 

4.1. The Basic Accuracy of Algorithm. In these tests BLOPEX LOBPCG 
computes the smallest 50 eigenvalues of 3D 7-point 200 x 200 x 200 and 200 x 201 x 202 
Laplacians using a direct application of the hypre BoomerAMG preconditioning. In 
the first case we have eigenvalues with multiplicity and in the second case the eigen- 
values are distinct, but clustered. The initial approximations are chosen randomly. 
We set the stopping tolerance (the norm of the maximum residual) equal to 10~ 6 . The 
numerical output (for the complete data, sec [20]) is compared to the exact eigenval- 



ues A 



i,3,k 



of the 3D 7-point 



Laplacians on the n x x n y x n z cube with 1 < i < n x , 1 < j < n y and 1 < k < n z 
In both tests for all eigenvalues the maximum relative error is less than 10 -8 , so 
LOBPCG is cluster robust, i.e. it does not miss (nearly) multiple eigenvalues, which 
supports the conclusion of Borzi and Borzi [4] . 
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4.2. Performance Versus the Number of Inner Iterations. We can exe- 
cute a preconditioner x = Tb directly or by calling PCG to solve a linear system, in 
our tests, Ax — b, so T approximates A^ 1 . Therefore, we can expect that increasing 
the number of "inner" iterations of the PCG might accelerate the overall convergence, 
but only if we do not make too many iterations, since even if T = A^ 1 , the conver- 
gence of the LOBPCG method is still linear. In other words, for a given matrix A 
and a particular choice of a preconditioner T, there should be an optimal finite (may 
be zero, corresponding to the direct application of T) number of inner iterations. 

We try to find this optimal number for the Schwarz-PCG and BoomcrAMG-PCG 
preconditioncrs in hypre and PETSc for the 7-point 3-D 100 x 100 x 100 Laplacian 
with the block-size m = 1. We measure the execution time as we vary the quality of 
the preconditioner by changing the maximum number of inner iterations in the cor- 
responding PCG solver. PETSc and hypre built-in Schwarz algorithms are different, 
but demonstrate a similar behavior. The hypre multigrid BoomerAMG and PFMG 
preconditioners are called from hypre test drivers. The hypre BoomerAMG precon- 
ditioner in addition is called through PETSc. The results for these tests involving 
different multigrid preconditioners are also similar. 

We find that for this problem the optimal number of inner iterations is approxi- 
mately 10-15 for Schwarz-PCG, but BoomerAMG-PCG works best if BoomerAMG is 
applied directly as a preconditioner, without even initializing the PCG solve function. 
Our explanation of this behavior is based on two facts. First, the Schwarz method 
with the default parameters used here is somewhat cheaper, but not of such a good 
quality, compared to BoomerAMG in these tests. Moreover, the costs for matrix- 
vector and multi-vector linear algebra in LOBPCG are relatively small compared to 
the costs of the BoomerAMG application, but is comparable to the costs of application 
of the Schwarz preconditioner here. Second, one PCG iteration is less computation- 
ally expensive compared to one LOBPCG iteration because of larger number of linear 
algebra operations with vectors in the latter. A single direct application of Boomer- 
AMG as the preconditioner in LOBPCG gives enough improvement in convergence 
to make it the best choice, while Schwarz requires more iterations that are less time 
consuming if performed using PCG, rather than by direct application in LOBPCG. 

In the next set of numerical tests for the same eigenproblem we use the block-size 
m = 10, and we vary the type of PCG preconditioner available in the hypre IJ interface 
together with the number of the inner PCG iterations. For the complete test data, see 
Knyazev and Argentati [20] ; here we provide only the summary. We observe that both 
the number of outer iterations and the execution time can be decreased significantly 
by choosing the maximum number of inner iterations optimally. For this problem, the 
BoomerAMG preconditioner applied directly has the fastest run time and converges in 
the smallest number of iterations, 31. For comparison, if we use no preconditioning at 
all then in 500 iterations that take ten times longer we still do not reach the required 
tolerance 10 -10 , which illustrates the importance of preconditioning. 

4.3. LOBPCG Performance vs. Block Size. We test hypre and PETSc 
LOBPCG on a 7-point 3D Laplacian with 128 x 128 x 128 w 2M unknowns using the 
hypre BoomerAMG preconditioner by increasing the block size m from 1 to 98. We 
run the tests on shared memory SUN Fire880 system, Dual and Single Core AMD 
Opterons Dual 275 CPUs servers, and a single I/O node of IBM BG/L with 32 CPUs. 
We find that the hypre and PETSc LOBPCG run times are very close, using the same 
tolerance 10 -8 , with PETSc extra overhead less than a few percents of the total times 
compared to a direct use of hypre, as expected. The growth of the CPU time with the 
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increase of the block size m is approximately linear for this problem up to m = 16, 
then the complexity term m 2 n becomes visible as well as the slower convergence rate 
for larger eigenvalues starts noticeably affecting the overall timing on Figure 4.1. 



x 10 



2.5 
2 

1.5 
1 

0.5 



■SUNFire 880 6 CPUs 
AMD dual 275 

AMD dual dual-core 275 / 1 

BL/G 1 I/O node, 32 CPUs 




40 60 
Block Size m 



80 



Fig. 4.1. LOBPCG Performance vs. Block Size. Preconditioner: hypre Boomer - 
AMG. 



4.4. Scalability. For the 7-point 3D Laplacian we vary the problem size propor- 
tional to the number of processors. We directly apply hypre Boomer AMG or PFMG 
multigrid prcconditioncrs. First, on the LLNL MCR cluster the eightfold increase of 
the number of dual-CPU nodes does not noticeably increase the LOBPCG CPU time, 
which is approximately 50 sec for BoomcrAMG and 180 sec for PFMG with m = 1 
and tolerance 10 -8 . The PFMG takes more time overall compared to BoomcrAMG 




Number of Nodes Number of Nodes 



Fig. 4.2. LOBPCG scalability, 1M unknowns per node on Beowulf. 

because of the larger convergence factor, but each iteration of the former is 50% faster 
than that of the latter, as we observe on Figure 4.2, which displays the CPU time per 
iteration for LOBPCG hypre code with hypre BoomerAMG and PFMG precondition- 
ers in our second test, on the Beowulf cluster. We see that the 32-fold increase in the 
number of dual-CPU nodes just slightly increases the LOBPCG CPU time. 
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# CPUs 


Matrix Size 


Setup (sec) 


# Iter. 


Solve time (sec.) 


8 


4.096 M 


7 


9/19 


62 / 580 


64 


32.768 M 


10 


8/17 


62 / 570 


512 


0.262144 B 


15 


7/13 


56 / 500 



Table 4.1 

Scalability for 80 x 80 x 80 = 512, 000 mesh per CPU, m = 1 / 6. 



# CPUs 


Matrix Size 


Setup (sec) 


# Iter. 


Solve time (thous. sec.) 


8 


1 M 


1 


34 


3 


64 


8 M 


5 


32 


3 


512 


64 M 


10 


29 


3 



Table 4.2 

Scalability for 50 x 50 x 50 = 125, 000 mesh per CPU, m = 50. 



Finally, we solve the 7-point 3D Laplacian eigenvalue problems for large matrix 
sizes and for a variety of block sizes m on the BG/L system. We observe a 10-20% 
variability in the iteration numbers and thus in the wall clock, due to the randomness 
of the initial approximations, so we report here typical average results. Tables 4.1 
and 4.2 contain the information about LOBPCG runs using hypre SMG geometric 
multigrid and tolerance 10~ 6 . 

Conclusions. The BLOPEX software has been integrated into the hypre library 
and is easily available in PETSc as an external package. BLOPEX is the only currently 
available software that solves eigenvalue problems using high quality existing hypre 
and PETSc preconditioners directly. Our abstract C implementation of the LOBPCG 
in BLOPEX allows easy deployment with different software packages. Our interface 
routines are based on hypre and PETSc standard interfaces and give the user an 
opportunity to provide matrix- vector multiply and preconditioned solver functions. 
The code scalability is tested for model problems of large sizes on parallel systems. 
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Appendix A. Description of all LOBPCG-BLOPEX variables. 

X (multivector). The iterative approximation to eigenvectors. 

A (function). Multiplies the function argument, which is a multivector, by the first 
matrix of the cigcnproblem. 

B (function). Multiplies the function argument, which is a multivector, by the second 
matrix of the cigcnproblem. 

T (function). Applies the preconditioner to the function argument, a multivector. 

Y (multivector). The given constraint that can be used for the hard locking. Itera- 
tions run in the S-orthogonal complement to the span of Y. 

A (diagonal matrix, in the actual code given by a vector of the diagonal entries). The 
iterative approximation to eigenvalues. 

W (multivector). Used to store the residuals (step 7) and the preconditioned residuals 
(step 9) and the preconditioned constrained residuals (step 10 and later). 

J (index vector). Denotes the set of indexes of active iterates. The starting value for 
J is {1, . . . , m}. If the norm of the residual rj for some j G J has become 
smaller than the tolerance, this index j is excluded from the set J. By Xj we 
denote a sub- matrix of X with the columns, which indexes are present in the 
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set J — these are the active iterates. The indexes from {1, . . . , m} that are not 
present in the set J correspond to iterative approximations to eigenvectors, 
which have been "soft locked." 
P (multivcctor). The LOBPCG "directions." 

gram A and gramB (matrices). The Gram matrices in the Raylcigh-Ritz procedure. 

C (matrix) . The matrix of the gramB-orthonormalizcd eigenvectors in the Raylcigh- 
Ritz procedure, used as the matrix of coefficients to compute the Ritz vectors, 
the new multivectors X, as linear combinations of the basis X, Wj, and Pj. 

AX, BX, AP, BP etc. (multivectors) . Implicitly computed results of the application 
of the corresponding functions to the corresponding vectors. In the absence 
of the round-off errors would be identical to their targets, e.g., AX = A*X. 

R (matrix). The upper triangular matrix of the Cholesky factorization, used for the 
J3-orthogonalization via Cholesky 

Appendix B. Installing and testing BLOPEX with PETSc and hypre. Here 
we give brief specific information for the reader who wants to install and test our 
BLOPEX software. This information concerns the current BLOPEX-1.0, PETSc- 
2.3.2, and hypre- 2. 0.0 releases. We refer to Appendix D for the acronyms and other 
names not explained here and suggest reading the software manuals for more details. 

To install and test BLOPEX and hypre in PETSc, download the PETSc source 
and follow the standard installation procedure with the external packages option. 
The present procedure is to add "— download-blopex=l — download- hyprc=l" options 
to the PETSc configuration. BLOPEX and hypre sources are then downloaded and 
compiled automatically. The BLOPEX test driver is already included in PETSc at 
the src/contrib/blopex/driver directory. In the following example: 
driver -da_grid_x 128 -da_grid_y 128 -da_grid_z 128 -ksp_type preonly -pc_type hypre 
-pc_hypre_type boomeramg -n_eigs 10 -seed 2 -tol le-6 -itr 5 -fulLout 1 
the eigenproblem for the 7-point 3D Laplacian is solved using the hypre BoomerAMG 
preconditioner directly. The last five parameters for BLOPEX are self-explanatory. 

BLOPEX LOBPCG is built into hypre and requires no special configuration op- 
tions, hypre supports four conceptual interfaces: STRUCT, SSTRUCT, FEM and IJ. 
LOBPCG has been tested with all but the FEM interface, hypre-2.0.0 test drivers 
for LOBPCG are in the sre/test directory. Examples with the same options as in the 
PETSc call above using IJ and STRUCT test drivers: 
ij -lobpeg -n 128 128 128 -pegitr -vrand 10 -seed 2 -tol le-6 -itr 5 -verb 1 
struct -lobpeg -n 128 128 128 -pegitr -vrand 10 -seed 2 -tol le-6 -itr 5 -verb 1 

Our code can, but is not really intended to, be used with the popular shift-and- 
invert strategy, since its main advantage compared to traditional eigensolvers is that 
it uses preconditioning. Our test drivers can call preconditioning directly ("-ksp_type 
preonly" PETSc option and "-pegitr 0" hypre option) as well as through calls to the 
/ij/pre/PETSc PCG method. In the latter case the action x = Tb of the preconditioner 
T on a given vector b is performed by calling a few steps of PCG to solve Ax = b. 
The BLOPEX LOBPCG code has been tested with all available hypre PCG-capable 
preconditioned in STRUCT, SSTRUCT and IJ interfaces, with the PETSc native 
Additive Schwarz, and with the PETSc linked BoomerAMG from hypre. 

Appendix C. Computers used for testing. 

AMD workstations Several single and dual core AMD two 64 bit 275 CPUs with 
16GB shared memory, running Linux, at CCM CU-Denver. 

SUN Fire 880 SUN Solaris 9, 6 UltraSPARC III 64 bit CPUs, 24GB shared mem- 
ory, at CCM CU-Denver. 
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Beowulf cluster Linux, 36 nodes, dual PHI 933MHz processors and 2GB memory 
per node with 7.2SCI Dolphin interconnect, at CCM CU-Denver. 

MCR cluster Linux, dual Xeon 2.4GHz 4 GB nodes with Quadrics, at LLNL. 

Blue Gene/L (BG/L) IBM single-rack with 1024 compute nodes, organized in 32 
I/O nodes with 32 compute nodes each. Every compute node is a dual-core 
chip, containing two 700MHz PowerPC-440 CPUs and 512MB of memory. We 
run using the default 1 CPU for computing and 1 CPU for communication 
on every compute node. Location: NCAR. 

Appendix D. Acronyms and other names. 

D.l. Organizations. 
CCM CU-Denver Center for Computational Mathematics, University of Colorado 

at Denver and Health Sciences Center, Downtown Denver Campus. 
LLNL Lawrence Livermore National Laboratory. 
NCAR National Center for Atmospheric Research. 

D.2. Software packages. 

BLOPEX Block Locally Optimal Preconditioned Eigenvalue Xolvers, a library for 
large scale eigenvalue problems, developed by the authors of the paper. 

PETSc "Portable, Extensible Toolkit for Scientific Computation," tools for the scal- 
able solution of partial differential equations and related problems developed 
by Argonne National Laboratory, see Balay et al. [2]. 

hypre "High Performance Preconditioncrs," scalable linear solvers package developed 
by LLNL, sec Falgout et al. [8, 9]. 

SLEPc "Scalable Library for Eigenvalue Problem Computations," a library for large 
eigenvalue problems, an extension of PETSc, sec Hernandez et al. [10]. 

D.3. hypre interfaces and preconditioners. 

IJ A Linear-Algebraic Interface for applications with sparse matrices. 

BoomerAMG a parallel implementation of an algebraic multigrid. 

Schwarz Agglomeration-based domain decomposition preconditioner. 
STRUCT A structured grid interface called that aims applications with logically 

rectangular grids. 

SMG A parallel scmicoarscning multigrid solver for the linear systems aris- 
ing from finite difference, finite volume, or finite element discretizations 
of the diffusion equation on logically rectangular grids. The code solves 
both 2D and 3D problems with discretization stencils of up to 9-point 
in 2D and up to 27-point in 3D. The algorithm semicoarsens in the 
z-direction and uses plane smoothing. 

PFMG PFMG is a parallel semicoarsening multigrid solver similar to SMG. 
The main difference between the two methods is in the smoother: PFMG 
uses simple pointwise smoothing. As a result, PFMG is not as robust 
as SMG, but is much more efficient per V-cycle. 
SSTRUCT A semi-structured grid interface named that targets applications with 

grids that are mostly structured, but with some unstructured features. 
FEI An unstructured grid interface designed for Finite Element applications. 

D.4. Methods. 

PCG Preconditioned Conjugate Gradient. 

LOBPCG Locally Optimal Block Preconditioned Conjugate Gradient. 
AMG Algebraic multigrid. 



